Skip to content

DAGE-88: Bug fix on Embedding Model Evaluator#240

Merged
dantuzi merged 3 commits intodataset-generatorfrom
DAGE-88_further-investigation
Oct 6, 2025
Merged

DAGE-88: Bug fix on Embedding Model Evaluator#240
dantuzi merged 3 commits intodataset-generatorfrom
DAGE-88_further-investigation

Conversation

@nicolo-rinaldi
Copy link
Copy Markdown
Collaborator

@nicolo-rinaldi nicolo-rinaldi commented Oct 3, 2025

DAGE-88: Jira ticket

The problem was the following: the preprocessing was good for reranking_task.py but NOT for retrieval_task.py. Looking at class AbsTaskRetrieval it can be spotted that the structure of the string to send to the evaluation process is different. Now we have 2 different function in helper.py, that have been separated for the 2 different tasks. This solves the problem of duplicated embeddings in the cache (can be spotted by using the logger in DEBUG mode -> the documents are pushed 2 times).
Another problem came out from logs in DEBUG mode: when I set the reranking task, the cache inside has only the documents that has a score in the candidates.jsonl file. This behaviour makes a lot of sense, since we don't need to have the embeddings of the whole corpus to compute the metrics needed. Since, in the documents_embeddings.jsonl file, we want the embedding of all the docs inside corpus.jsonl file, we still need to remove name and normalize_embeddings, as in Naz PR. The embeddings of the documents which does NOT appear in candidates, are computed by the SentenceTransformer model linked to the cache while the file is being written in the resource/embeddings/ folder.

The problem was the following: the preprocessing was good for reranking_task.py but NOT for retrieval_task.py. Looking at class `AbsTaskRetrieval` it can be spotted that the structure of the string to send to the evaluation process is different. Now we have 2 different function in helper.py, that have been separated for the 2 different tasks. This solves the problem of duplicated embeddings in the cache (can be spotted by using the logger in DEBUG mode -> the documents are pushed 2 times).
Another problem came out from logs in DEBUG mode: when I set the reranking task, the cache inside has only the documents that has a score in the candidates.jsonl file. This behaviour makes a lot of sense, since we don't need to have the embeddings of the whole corpus to compute the metrics needed. Since, in the documents_embeddings.jsonl file, we want the embedding of all the docs inside corpus.jsonl file, we still need to remove name and normalize_embeddings, as in Naz PR. The embeddings of the documents which does NOT appear in candidates, are computed by the SentenceTransformer model linked to the cache while the file is being written in the `resource/embeddings/` folder.
@nicolo-rinaldi nicolo-rinaldi force-pushed the DAGE-88_further-investigation branch from 984e9dd to 489167b Compare October 3, 2025 14:45
@nicolo-rinaldi nicolo-rinaldi force-pushed the DAGE-88_further-investigation branch from 489167b to c99ed37 Compare October 6, 2025 10:54
@nicolo-rinaldi nicolo-rinaldi force-pushed the DAGE-88_further-investigation branch from c99ed37 to 14a14a6 Compare October 6, 2025 10:55
@dantuzi dantuzi merged commit 3e46425 into dataset-generator Oct 6, 2025
6 checks passed
@dantuzi dantuzi deleted the DAGE-88_further-investigation branch October 6, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants